
# import libraries
# ARIMA libraries
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
from statsmodels.tsa.stattools import adfuller
import warnings
# Global map visualization
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer
# Regular libraries
from datetime import datetime, timedelta
import dateutil.parser
import geopandas as gpd
import glob
from IPython.display import Image
import json
import math
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12,8)
import numpy as np
import pandas as pd
import seaborn as sns
Covid-19 is a global pandemic. We have over 150 countries affected and many approaches to the pandemic. It is possible to use data science to evaluate the leadership of countries by using mortality as a litmus test. This notebook, collects the data necessary for that, graphically shows the results for those countries, and then also runs a machine learning model to predict the total mortality by the end of 2020 (global not by country).
We will use people per million and median income as our metrics for this. An extremely small country would have a relatively small number of deaths and could be totally incompetent at managing covid. Whereas a large country (by population) could do a great job and still have a large number of deaths. Also we have included the Transparency International ranking. Many countries are manipulating their data.
This will be defined as the number of New_Cases per day over the last 14 days. We take the last date that data is available and we see if the number of New_Cases is going down. We will do this with a scatter plot and a regression line that is fitted to the data.
The columns change over time. I need to figure out what the super set of these column names are. Use glob and sets and a loop to create this result.
path = r'C:\Users\linds\OneDrive\mystuff\GitHub\COVID-19\csse_covid_19_data\csse_covid_19_daily_reports'
all_files = glob.glob(path + "/*.csv")
li_set = {}
for filename in all_files:
df = pd.read_csv(filename)
cols = df.columns
cols = set(cols)
li_set = cols.union(li_set)
print(li_set)
# We want the LAST dfs column names
all_df_cols = df.columns
all_df_cols
li_set is the superset of all column names. So, it is NOT the same as the last csv that I have. I have to do some renaming with every csv that comes in. Now that we know that, we know what column names to change and now we can bring in the entire set of files and concatenate them.
def short_df(df, temp_df):
# Pandas really does not like '/' or ' ' in a column name so ... time to rename
df.rename(columns={'Province/State': 'Province_State',
'Country/Region': 'Country_Region',
'Last Update': 'Last_Update',
'Latitude': 'Lat',
'Longitude': 'Long_'}, inplace=True)
# Get the current_df columns
cols = df.columns
# For loop for putting the appropriate columns in temp_df
for col in cols:
temp_df[col] = df[col]
return temp_df
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
# Read in the next csv
df = pd.read_csv(filename)
# Get the date of the file
date_string = filename[-14:-4]
# Insert the date_string into the first column of the df
df.insert(0, "Date_", date_string)
# If it is a full df then just append it
if df.shape[1] == 13:
li.append(df)
else:
# Need to build a df that is the same number of columns as the largest df (last one)
# Create a temp_df to hold everything
temp_df = pd.DataFrame(data=np.nan, columns=all_df_cols.insert(0, 'Date_'), index=df.index)
# Call function to create the short_df
df = short_df(df, temp_df)
# append df to li
li.append(df)
all_df = pd.concat(li)
all_df.shape
# reset the index
all_df.reset_index(inplace=True, drop=True)
# Convert Date_ column to date
# Convert Date_ to datetime format
all_df['Date_'] = pd.to_datetime(all_df['Date_'], infer_datetime_format=True)
all_df.head()
Now we need population and medium income data. This has been placed on disk in the local directory.
# Read in population and income data
pop_med_income_df = pd.read_csv(r'data/pop_med_income.csv')
pop_med_income_df.rename({"MedianPerCapitaIncome": 'MPC_Inc'}, axis=1, inplace=True)
pop_med_income_df.head()
pop_med_income_df.dtypes
I had an already existing data file that I had originally gotten from World Bank https://wits.worldbank.org/wits/wits/witshelp/content/codes/country_codes.htm. I have added items to it since, the names of the countries keep on changing in the different datasets that I encounter. Eventually this should be bullet proof. Some data munging was required on this. The John Hopkins data does not have an ISO standard Country column.
# Now we want to add a country code
country_codes_df = pd.read_csv(r'data/country_codes_edited.csv')
country_codes_df.head()
# Lets rename the 3 letter code column
country_codes_df.rename({'ISO3166-1-Alpha-3': 'Alpha_3'}, axis=1, inplace =True)
country_codes_df.head(1)
# We only want Country_Region and Alpha-3
country_codes_df = country_codes_df[['Country_Region', 'Alpha_3']]
country_codes_df.head(1)
One of the huge issues with the covid-19 data is there are lots of politics being played with the numbers by politicians. We will use the Transparency International https://www.transparency.org/cpi2019 data 2019 to give us an inkling of whether or not we can trust the numbers.
trans_int_df = pd.read_csv(r'data/transparency_international.csv')
trans_int_df.head()
# Merge the two dfs (country_codes_df and pop_med_income_df)
pop_inc_cc_codes = country_codes_df.merge(pop_med_income_df)
pop_inc_cc_codes.head()
# Merge the two dfs (country_codes_df and pop_med_income_df)
pop_inc_cc_codes = pop_inc_cc_codes.merge(trans_int_df)
pop_inc_cc_codes.head()
We will merge this dataframe with the all_df after we have done some aggregations with all_df. However, right now we need to go to the next step which is to describe the data.
all_df.describe()
The above are all of the numeric columns. For our analysis we can drop FIPS, Recovered, and Active. We will keep the balance of the numeric columns. We have put a record of this in the todos in the Clean section.
all_df.info()
Check for duplicates now.
# Test
all_df.duplicated().sum()
Excellent no duplicates!
We will also drop Admin2, Province_State, Last_Update, and Combined_Key. We are doing everything with Country. We have put a record of this in the todos in the Clean section. However, in order to use our time efficiently as we are describing the data, we will drop these columns now. There is no sense describing columns that you are dropping.
# Code
# Make a copy to avoid the inevitable Pandas warnings.
slim_df = all_df.copy(deep=True)
# Just keep the columns we want.
slim_df = slim_df[['Date_', 'Country_Region', 'Lat', 'Long_', 'Confirmed', 'Deaths']]
# Test
slim_df.head(1)
# Null status
slim_df.isnull().mean()
Add fill backwards to the clean section. There are a few missing values. Need to deal with this now so that we can look at data distribution.
# Code
slim_df = slim_df.bfill()
# Test
# Null status
slim_df.isnull().mean()
Excellent no nulls!
slim_df.describe()
As you can tell from the .describe() method this is essentially a dataframe of outliers. We are just trying to see the spread, so ... we can be creative.
slim_df.info()
# Get the numeric columns
cols = slim_df.columns
cols = cols[4:]
cols
# Get rid of all of the row that are all zeros
no_zeros_df = slim_df[(slim_df[cols].T != 0).all()]
print(no_zeros_df.shape)
no_zeros_df
Now we can run some boxplots (below) since we will not have divide by 0 errors.
no_zeros_df.describe()
no_zeros_df.hist();
# Lets create a log10 of Confirmed and Deaths
log10_df = no_zeros_df[cols]
log10_df = np.log10(log10_df)
# Now plot Confirmed
plt.figure(figsize=(6, 4))
sns.boxplot(x=log10_df['Confirmed'])
# Now plot Deaths
plt.figure(figsize=(6, 4))
sns.boxplot(x=log10_df['Deaths'])
As expected, no real insightful information to be gained from this.
all_df
country_codes_df
pop_med_income_df
trans_int_df
As we go thru this notebook we will be transforming the data so that we can answer the questions posed. To start lets look at some global visualizations.
def plot_confirmed_death(df, place):
"""Plots daily running total of confirmed cases and deaths"""
start = df.index[0].strftime('%Y-%m-%d')
end = df.index[-1].strftime('%Y-%m-%d')
title = ('From {} to {} Daily Running Total Statistics For Covid-19 ({})').format(start, end, place)
plt.title(title)
plt.xlabel('Date')
plt.ylabel('# Of Cases and Deaths Per Day')
plt.plot('Confirmed', data=df, color='orange', linewidth=2, label='Confirmed Cases')
plt.plot('Deaths', data=df, linewidth=2, label='Deaths')
plt.legend()
plt.savefig(r'pics_final/running_totals.png');
def plot_mortality_rate(df, place):
"""Mortality rate plotted"""
start = df.index[0].strftime('%Y-%m-%d')
end = df.index[-1].strftime('%Y-%m-%d')
title = ('From {} to {} Change in Mortality For Covid-19 ({})').format(start, end, place)
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Mortality Rate')
plt.plot('Mortality_Rate', data=df, linewidth=2, label='Deaths')
plt.legend()
plt.savefig(r'pics_final/mortality_rate.png');
def mortality_population(df, sign):
"""Plot the Deaths_e6. This is adjusted for population. It shows what countries are doing well or poorly
considering their Median Income and Transparency International score"""
sns.set_color_codes("pastel")
sns.barplot(x="Deaths_e6", y="Country_Region", data=df,
label="Deaths Per Million Population", palette="Blues_d")
# Title
title = ('Mortality Rate Per Million Population As of {}').format(end)
plt.xlabel('Deaths Per Million')
plt.title(title)
# Construct legend
legend_ = f'Includes Only Countries With {sign} 50% Median Per Capita Income and\
>= 60 on Transparency International Score'
plt.legend([legend_])
# Final housekeeping
sns.despine(left=True, bottom=True)
plt.savefig(r'pics_final/honest_but_no_money.png');
def regression_new_cases(df, place, days):
"""Number of New_Cases with Regression Line"""
# Create the title
end = df['Date_'].iloc[-1].strftime('%Y-%m-%d')
title = ('New Cases For Last {} Days Prior To And Including {} For Covid-19 With Regression Line ({})').format(
days, end, place)
plt.title(title)
# We want the indexes in a list so we can place them on the x axis
new_labels = df['Date_'].dt.strftime('%m-%d').tolist()
plt.xticks(np.arange(len(new_labels)), new_labels)
# Label the axis
plt.xlabel('Consecutive Days')
plt.ylabel('New_Cases Daily')
# Create the plot
sns.regplot(x=range(len(df)), y='New_Cases', data=df)
plt.savefig(r'pics_final/regresion_mortality_rate.jpg');
def regression_deaths(df, place):
"""Number of Deaths with Regression Line"""
# Get start and end times of df
start = df.index[0].strftime('%Y-%m-%d')
end = df.index[-1].strftime('%Y-%m-%d')
# Format the title and labels
title = ('From {} to {} Daily Mortality For Covid-19 ({})').format(start, end, place)
plt.title(title)
plt.xlabel('Days')
plt.ylabel('# Of Deaths Per Day')
# Plot regression line
sns.regplot(x=range(len(df)), y='Deaths', data=df, label='Deaths')
plt.savefig(r'pics_final/running_totals.png');
def plot_world_map(merged, values):
"""Maps features onto a global map for visualization purposes"""
# unpack values
feature, legend_string, title_string = values
# Read data to json.
merged_json = json.loads(merged.to_json())
# Convert to String like object.
json_data = json.dumps(merged_json)
# Input GeoJSON source that contains features for plotting.
geosource = GeoJSONDataSource(geojson = json_data)
# Define a sequential multi-hue color palette.
palette = brewer['YlGnBu'][8]
# Reverse color order so that dark blue is highest obesity.
palette = palette[::-1]
# Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LinearColorMapper(palette = palette,
low = 0,
high = 1,
nan_color = '#d9d9d9')
# Define custom tick labels for color bar.
tick_labels = {'0': legend_string}
# Create color bar.
color_bar = ColorBar(color_mapper=color_mapper,
label_standoff = 8,
width = 500, height = 20,
border_line_color = None,
location = (0,0),
orientation = 'horizontal',
major_label_overrides = tick_labels)
# Create figure object.
p = figure(title = title_string,
plot_height = 600 ,
plot_width = 950,
toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
# Add patch renderer to figure.
p.patches('xs','ys', source = geosource, fill_color = {'field' : feature, 'transform' : color_mapper},
line_color = 'black', line_width = 0.25, fill_alpha = 1)
# Specify figure layout.
p.add_layout(color_bar, 'below')
# Display figure inline in Jupyter Notebook.
output_notebook()
# Display figure.
show(p)
def calc_angle(df, a):
"""Calculates the angle of the regression line"""
# Getting the length of all 3 sides
# a is visualized based on a regression line of April 30 2020. This may change (obviously)
b = (df.index[-1] - df.index[0]).days # Length of the dataframe (x axis).
c = math.sqrt(a**2 + b**2) # calculate the hypotenuse
# Calculate the tangent
tan = math.atan(a/b)
angle = math.degrees(tan)
return angle
def calc_deaths(angle, start, end):
"""Calculates the number of deaths based on the angle generated from calc_angle and days calculated from dates"""
# Calculate the number of days for b (x axis / adjacent)
start = dateutil.parser.parse(start)
end = dateutil.parser.parse(end)
days_ = end - start
days_ = days_.days
# Calculate the number of deaths (height / opposite)
deaths = days_ * (math.tan(math.radians(angle)))
return int(deaths), days_
def place_value(number):
"""Changes number to a readable number with ','s for 000s'"""
return ("{:,}".format(number))
# Plot global mortality rate
global_df = slim_df.groupby('Date_').sum()
global_df = global_df[['Confirmed', 'Deaths']]
global_df['Mortality_Rate'] = global_df['Deaths'] / global_df['Confirmed']
# Clip outliers for mortality
global_df = global_df[global_df['Mortality_Rate'] <= .1]
plot_mortality_rate(global_df, 'Global')
It is extremely likely that the number of cases is at least 10 times higher than is being Confirmed. So, this mortality rate is likely 1/10 of the number shown here.
global_df.describe()
# Plot Confirmed and Deaths in one plot
plot_confirmed_death(global_df, 'Global')
# Run plot for New York
new_york = all_df[all_df['Admin2'] == 'New York City'].copy(deep=True)
new_york.set_index('Date_', drop=True, inplace=True)
plot_confirmed_death(new_york, 'New york')
These are both VERY scary plots. They do not show any abatement or signs of flattening. If anything they have gone exponentially higher.
Who are the best and worst countries dealing with Covid-19?
We need a dataframe that has these columns but we ONLY want the last date in the Covid-19 data. Then a groupby of that dataframe. Then to this dataframe we append Deaths_e6, Confirmed_e6, Median_Income, and the Transparency International score.
# Now merge all_df with country_codes_df.
# First of all rename Country in country_codes_df to Country_Region to make the inner join simple
covid_pop_inc_df = all_df.merge(country_codes_df)
covid_pop_inc_df.head()
len(covid_pop_inc_df)
# Lets put covid_pop_inc_df on a diet
covid_pop_inc_df = covid_pop_inc_df[['Date_', 'Alpha_3', 'Country_Region', 'Lat', 'Long_', 'Confirmed', 'Deaths']]
covid_pop_inc_df.head()
pop_inc_cc_codes.head(1)
pop_inc_cc_codes.dtypes
covid_pop_inc_df.dtypes
covid_pop_inc_df.head(1)
q1_df = covid_pop_inc_df.groupby(['Date_','Alpha_3'], as_index=False)['Confirmed', 'Deaths'].sum()
end = q1_df.iloc[-1,0].strftime('%Y-%m-%d')
end
q1_df = q1_df[q1_df['Date_'] == end]
print(q1_df.shape)
q1_df.head(1)
# Now we want to merge the population and median income data with this.
q1_df = q1_df.merge(pop_inc_cc_codes)
q1_df.head(1)
# Now convert Confirmed and Deaths to a rate per million population
q1_df['Confirmed_e6'] = q1_df['Confirmed'] / (q1_df['Pop2020'] / 1000000)
q1_df['Deaths_e6'] = q1_df['Deaths'] / (q1_df['Pop2020'] / 1000000)
q1_df.head(1)
# Make a copy of this df
q1_df_c = q1_df.copy(deep=True)
We do not think the reporting is correct for countries that are very poor. We are restricting our analysis to just the top 50% of all countries based on MedianPerCapitaIncome AND the country must have a Transparency International rating of 60 or over. The USA has a rating of 69.
q1_df = q1_df[q1_df['TI_2019'] >= 60]
q1_median_income = int(q1_df['MPC_Inc'].median())
q1_df = q1_df[q1_df['MPC_Inc'] >= q1_median_income]
q1_df.sort_values('Deaths_e6', inplace=True)
print(q1_df.shape)
q1_df.head()
sign = '>='
mortality_population(q1_df, sign)
How many unique Alpha_3 codes (Country Codes) do we have in this dataset?
q1_df['Alpha_3'].nunique()
The above graph shows who is performing the best and the worst on the covid-19 pandemic. The only countries included in this bar graph have a Transparency International score of >= 60 and a MedianPerCapita Income >= 50% of the worlds countries. Countries with the shortest bars are doing better. Remember this IS adjusted for population. So, the fact that New Zealand is right at the top is remarkable. We should be looking at Australia and New Zealand to understand how they achieved this.
The USA is in the bottom third and the UK is at the very bottom. Perhaps a wake up call for their populations? The universe of countries that I chose for this was severely restricted by having the above cutoffs. As a result I feel this is a honest evaluation of the facts.
One sobering thought here is that there are 203 unique Alpha_3 codes in this dataset. That corresponds to 203 countries. We are only looking at 28 of them. This is because of the concern over the honesty of the data. In otherwords the challenge that we are looking at here is only the tip of the iceberg. NB There are 195 countries in the world https://www.thoughtco.com/number-of-countries-in-the-world-1433445. The John Hopkins data has odd Country_Regions such as Cruise Ships.
I also looked at only countries that are "poor" but "honest" below.
# Lets se how the other half lives:)
no_money_but_honest = q1_df_c[q1_df_c['TI_2019'] >= 60]
no_money_but_honest = no_money_but_honest[no_money_but_honest['MPC_Inc'] < q1_median_income]
no_money_but_honest.sort_values('Deaths_e6', inplace=True)
# Call the plotting function
sign = '<'
mortality_population(no_money_but_honest, sign)
print(no_money_but_honest['Country_Region'].tolist())
There are no obvious take aways from this list. Taiwan is an island and was very aware of the dangers due to its intimate relationship with China. Botswana likely does not have the ability to count deaths. That sounds harsh but ... no national health system. People are probably just dying at home with no testing. Japan is highly sophisticated but ... it is surprising they are on this list as having a lower Median Per Capita Income. They just missed the cutoff. Anyways, no obvious take aways from this one.
Which countries have flattened or are flattening the curve?
We will take our universe of countries from a set that is twice as large as Q1 above and see if they are experiencing a flattening of the curve. The metric will be the last 14 days with a regression line. If the regression line is flat or going down, they are flattening the curve.
# groupby that creates a sum for each date for each country (Alpha_3).
q2_df = covid_pop_inc_df.groupby(['Date_','Alpha_3'], as_index=False)['Confirmed', 'Deaths'].sum()
q2_df.head()
# Now we want to merge the population and median income data with this.
q2_df = q2_df.merge(pop_inc_cc_codes)
print(q2_df.shape)
q2_df.head(1)
# Need to add the column New_Cases
# Calculate # of New_Cases/day
q2_df['New_Cases'] = q2_df['Confirmed'] - q2_df['Confirmed'].shift(1)
print(q2_df.shape)
q2_df.tail(5)
# Now convert New_Casess to a rate per million population
q2_df['New_Cases_e6'] = q2_df['Confirmed'] / (q2_df['Pop2020'] / 1000000)
q2_df.head(1)
q2_df.sort_values(['Date_', 'Alpha_3'], ignore_index=True, inplace=True)
print(q2_df.shape)
q2_df.tail()
# This is the df that we need for q4. Lets make a copy of it.
q4_df = q2_df.copy(deep=True)
# Need the date of the last row in the df
end = q2_df.iloc[-1,0]
end
# We want the date x days ago
days = 14
start = end - timedelta(days=days)
start
q2_df = q2_df[q2_df['Date_'] >= start]
q2_df.shape
We do not think the reporting is correct. We are restricting our analysis to countries that have a Transparency International rating >= 60. The USA has a rating of 69.
q2_df = q2_df[q2_df['TI_2019'] >= 60]
print(q2_df.shape)
q2_df.head()
# Get a list with no duplicates from Country_Region
countries = q2_df['Country_Region'].tolist()
countries = (set(countries))
countries = list(countries)
countries.sort()
print(len(countries))
print(countries)
# Did not run this one because the output was too small to be of use.
# I include it here because if I was reviewing this I would ask why I did not use FacetGrid.
# g = sns.FacetGrid(q2_df, col="Country_Region", col_wrap=4, height=2)
# g.map(sns.pointplot, 'Date_String', 'New_Cases', order=date_string, color=".3", ci=None);
We run the below regression plot in full size. In small size (above), you lose the details of whether or not the # of New_Cases is trending up or down.
# Run regression line plot of ONLY countries that ARE NOT flattening the curve
flat = []
still_rising = []
for country in countries:
# Just the country of interest
recent = q2_df[q2_df['Country_Region'] == country]
# Check and see if there is any data in this df
if len(recent) > 0:
# Check and see if it is rising or falling
# Find half way
first_half = float(len(recent))
first_half = int(first_half / 2)
# Do the math
if recent['New_Cases'].iloc[:first_half].mean() > recent['New_Cases'].iloc[first_half:].mean():
flat.append(country)
else:
still_rising.append(country)
else:
print('This country is in the countries list but ... has no data associated with it.', country)
# Plot
regression_new_cases(recent, country, days)
# Create legend
plt.legend([f'{country} Has a Transparency International Score of {recent["TI_2019"].iloc[0]}'], loc='upper center')
plt.savefig(r'pics_final/rising_falling.jpg')
plt.show();
print('flat \n', sorted(flat))
print('\nstill_rising \n', sorted(still_rising))
The above graphs show you the rising and falling regression lines for each country. NB some times the crude calculation that I make in the code is not as capable at recognizing the trend as the human eye. The flat list contains the countries that have daily flat to declining number of New_Cases of covid-19. Much better than I thought!
We do not cutoff based on Median Per Capita Income on this list. The only cutoff is the Transparency International score. That is why we have ~ twice as many countries being considered.
Can I see a global geographic representation of infections?
Yes, however, we are only going to do this for the last date that we have data on. The John Hopkins dashboard at https://coronavirus.jhu.edu/map.html is an impressive interactive facility that has the ability visualize data geographically at a much greater level of sophistication than what I will be doing here. The purpose here was to append demographic information (population and Median Per Capita Income) and a "honesty score" via the Transparency International score so that we cam gain some confidence about the usefulness of this data.
# Identify shape file
shapefile = r'data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp'
#Read shapefile using Geopandas
gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]
#Rename columns.
gdf.columns = ['country', 'country_code', 'geometry']
gdf.head()
We can drop the row for ‘Antarctica’ as it unnecessarily occupies a large space in our map and is not required in our current analysis.
gdf[gdf['country'] == 'Antarctica']
#Drop row corresponding to 'Antarctica'
gdf = gdf.drop(gdf.index[159])
# groupby that creates a sum for each date for each country (Alpha_3).
q3_df = covid_pop_inc_df.groupby(['Date_','Alpha_3'], as_index=False)['Confirmed', 'Deaths'].sum()
q3_df.head()
# Now we want to merge the population and median income data with this.
q3_df = q3_df.merge(pop_inc_cc_codes)
print(q3_df.shape)
q3_df.head(1)
# Restrict the analysis to just the most recent date for data
q3_df = q3_df[q3_df['Date_'] == end]
print(q3_df.shape)
q3_df.tail(1)
q3_df.describe()
q3_df.isnull().mean()
Above looks fine.
# Change Date to string date. Bokey does not like real dates (sigh)
q3_df['Date_'] = end.strftime('%Y-%m-%d')
q3_df.head(1)
# Add column for Deaths Per Million
q3_df['Deaths_e6'] = q3_df['Deaths'] / (q3_df['Pop2020'] / 1000000)
Need to normalize the data.
# Get object column names
cat_cols = q3_df.select_dtypes('object').columns
cat_cols
# Keep only the categorical columns
cat_cols_df = q3_df[cat_cols]
cat_cols_df.reset_index(drop=True, inplace=True)
print(cat_cols_df.shape)
cat_cols_df.head(1)
# Get numeric column names
num_cols = q3_df.select_dtypes([np.number]).columns
num_cols
# numpy array of just the numeric values
x = q3_df[num_cols].values
x.shape
# normalize the data
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
# Now a numeric df
num_cols_df = pd.DataFrame(x_scaled, columns=num_cols)
print(num_cols_df.shape)
num_cols_df.head(1)
# Concatenate the two dataframes.
new_df = pd.concat([cat_cols_df, num_cols_df], axis=1)
print(new_df.shape)
new_df.head(1)
new_df.isnull().mean()
new_df.describe()
Above all looks good!
# Merge dataframes with left join. Avoids missing countries.
merged = gdf.merge(new_df, left_on = 'country_code', right_on = 'Alpha_3', how='left')
# Replace NaN values to string 'No data'. This is caused by the left join (obviously)
merged.fillna('No data', inplace = True)
print(merged.shape)
merged.head(5)
q3_df.head(1)
# Instantiate a dictionary with the numeric feature columns
feature_dict = dict(zip(num_cols, range(len(num_cols))))
feature_dict
# Now that we have this in a function, lets run all of the numeric columns thru this.
# Create a dictionary with the appropriate strings
feature_dict.update({num_cols[0]: ['Confirmed', 'Fewest Cases', 'Covid-19 Confirmed Cases By Country'],
num_cols[1]: ['Deaths', 'Fewest Deaths', 'Covid-19 Deaths By Country'],
num_cols[2]: ['MPC_Inc', 'Lowest', 'Median Per Capita Income By Country'],
num_cols[3]: ['Pop2020', 'Lowest', 'Population By Country'],
num_cols[4]: ['TI_2019', 'Least Honest', 'Transparency International Score By Country'],
num_cols[5]: ['Deaths_e6', 'Fewest Deaths', 'Covid-19 Deaths By Country Per Million Population']
})
# Prepare map for all features
for feature in num_cols:
plot_world_map(merged, feature_dict[feature])
The above images quickly tell you where the challenges are. Any country that shows color on the 'Covid-19 Deaths By Country Adjusted For Population' map is in trouble. The grey countries are not reporting data. If you look at the 'Transparency International Score By Country' (right above that one), you can immediately see the issue about reporting. Countries that are honest are reporting and as a result report higher deaths. The vast majority of countries are either not reporting or are "not honest".
What is the projected global mortality by December 31, 2020?
The way we are going to answer this is to look at a correlation heat map. That will instruct as to what variable would be helpful in terms of predicting deaths. Then we are simply going to do a simple regression line and extend it using trigonometry. Finally I will run an ARMIM model that predicts every day going forward to the end of 2020.
# Look at df just to remind me of what I have.
q4_df.head(1)
# Look at the correlations. Should be pretty good.
sns.heatmap(df.corr(), annot=True, fmt=".2f");
The correlation is very high between Confirmed and Deaths. You can see New_Cases_e6 is also correlated (.54) with deaths. There are obvious cases of multicollinearity in the features that are in this dataset (Confirmed, New_Cases, New_Cases_e6). There is considerable uncertainty about the quality of the data that is being reported. We want to predict to the end of 2020 what will be the total number of global deaths. If we have a regression line we can do that quite simply using trigonometry. Lets do that below.
We simply do not need all of the complexity of the q4_df dataset to answer this question. The globaldf that we created earlier in the notebook works quite well for this. It is created via a groupby on Date with a sum. This yields the following very simple dataset after dropping a few columns. Remember though, for Confirmed and for Deaths the numbers are cumulative. Mortality_Rate is a simple division based on the previous 2 columns.
print(global_df.shape)
global_df.tail(1)
We are going to find the angle (bottom left hand side) then calculate the Opposite. The Adjacent is the difference in days between Feb 1 2020 and Dec 31 2020. In order to get the initial angle we are using the above regression line and creating a right triangle to work from.
regression_deaths(global_df, 'Global')
# Estimate the total number of deaths by the end of the year
# Find the angle based on the above regression line.
angle = calc_angle(global_df, 230000)
# Time period
start = '2020-02-01'
end = '2020-12-31'
# Estimate year end deaths
deaths, days = calc_deaths(angle, start, end)
print(f'The number of expected deaths forecast to occcur from Covid-19 by {end} is: {place_value(deaths)}.')
print(f'This is over the next {days} days.')
I have a number of features that we could use to predict mortality. However, the obvious one is to simply use Death. This machine learning algorithm is well suited to univariate time series.
Create and summarize a stationary version of the time series.
# create a differenced series
def difference(dataset):
"""Create a difference between first value and the second value in the input"""
diff = []
for i in range(1, len(dataset)):
value = dataset[i] - dataset[i - 1]
diff.append(value)
return pd.Series(diff)
# Load the data
series = global_df['Deaths']
series.head()
# Convert it to an array
X = series.values
X = X.astype('float32')
# difference data
stationary = difference(X)
stationary.index = series.index[1:]
# check if stationary
result = adfuller(stationary)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
# plot differenced data
stationary.plot()
plt.show();
That is not stationary!. That is not good. We can get the ARIMA model to grid search a solution to this.
# save
stationary.to_csv(r'data/stationary.csv')
# ACF and PACF plots of the time series
series = global_df['Deaths']
plt.figure(figsize=(16,6))
plt.subplot(211)
plot_acf(series, ax=plt.gca())
plt.subplot(212)
plot_pacf(series, ax=plt.gca())
plt.show()
A good starting point for the p is 7 and q as 1. This quick analysis suggests an ARIMA(7,1,1) on the raw data may be a good starting point. I ran it. It did not converge. So, lets evaluate a series of ARIMA models with a try and except block to avoid "LinAlgError: SVD did not converge"
# grid search ARIMA parameters for a time series
def evaluate_arima_model(X, arima_order):
"""evaluate an ARIMA model for a given order (p,d,q) and return RMSE"""
# prepare training dataset
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
history = [x for x in train]
# make predictions
predictions = []
for t in range(len(test)):
model = ARIMA(history, order=arima_order)
# fit model
try:
model_fit = model.fit(trend='nc', disp=0)
yhat = model_fit.forecast()[0]
predictions.append(yhat)
history.append(test[t])
except:
continue
# calculate out of sample error
rmse = sqrt(mean_squared_error(test, predictions))
return rmse
def evaluate_models(dataset, p_values, d_values, q_values):
"""evaluate combinations of p, d and q values for an ARIMA model"""
dataset = dataset.astype('float32')
best_score, best_cfg = float("inf"), None
for p in p_values:
for d in d_values:
for q in q_values:
order = (p,d,q)
try:
rmse = evaluate_arima_model(dataset, order)
if rmse < best_score:
best_score, best_cfg = rmse, order
print('ARIMA%s RMSE=%.3f' % (order,rmse))
except:
continue
print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))
return best_cfg
# load dataset
series = global_df['Deaths']
# evaluate parameters
p_values = range(0, 8)
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
best_cfg = evaluate_models(series.values, p_values, d_values, q_values)
As we can see from the above, a LOT of these parameters resulted in the model not being able to fit it.
# summarize residual errors for an ARIMA model
# load data
series = global_df['Deaths']
# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = []
for i in range(len(test)):
# predict
model = ARIMA(history, order=best_cfg)
model_fit = model.fit(trend='nc', disp=0)
yhat = model_fit.forecast()[0]
predictions.append(yhat)
# observation
obs = test[i]
history.append(obs)
# errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = pd.DataFrame(residuals)
print(residuals.describe())
# plot
plt.figure()
plt.subplot(211)
residuals.hist(ax=plt.gca())
plt.subplot(212)
residuals.plot(kind='kde', ax=plt.gca())
plt.show();
# Save mean for bias adjustment below
bias = residuals.describe()
bias = bias.iloc[1][0]
print('\nbias saved in bias for subsequent run is:', bias, '\n')
The histogram shows that the distribution has a right skew and that the mean is non-zero. This is perhaps a sign that the predictions are biased. The distribution of residual errors is also plotted. The graphs suggest a Gaussian-like distribution with a longer left tail. Lets see if the bias made any real difference.
# summarize residual errors from bias corrected forecasts
# load data
series = global_df['Deaths']
# prepare data
X = series.values.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
bias = bias # from above run
for i in range(len(test)):
# predict
model = ARIMA(history, order=best_cfg)
model_fit = model.fit(trend='nc', disp=0)
yhat = bias + float(model_fit.forecast()[0])
predictions.append(yhat)
# observation
obs = test[i]
history.append(obs)
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)
# summarize residual errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = pd.DataFrame(residuals)
print(residuals.describe())
# plot
# histogram
plt.figure()
plt.subplot(211)
residuals.hist(ax=plt.gca())
# density
plt.subplot(212)
residuals.plot(kind='kde', ax=plt.gca())
plt.show()
The bias had almost no effect on the RMSE. Considering that we will be looking at deaths in the range of 6 to 7 figures, not surprising that this made so little difference.
The summary of the forecast residual errors shows that the mean was indeed moved to a value very close to zero. Finally, density plots of the residual error do show a small shift towards zero. Lets take a look at test vs predictions.
# Assemble title
title = ('Model Prediction of Test vs Predictions For Covid-19 ({}) For {} Days').format('Global', len(test))
plt.title(title)
# Create x and y axis labels
plt.xlabel('Days')
plt.ylabel('Cumulative Deaths By Day')
# Create plot
plt.plot(test, linewidth=2, label='Test')
plt.plot(predictions, linewidth=2, label='Predictions')
plt.legend()
plt.savefig(r'pics_final/test_vs_prediction.png');
Wow hard to believe that this is that accurate!
# save finalized model to file
def __getnewargs__(self):
"""# monkey patch around bug in ARIMA class"""
return ((self.endog),(self.k_lags, self.k_diff, self.k_ma))
ARIMA.__getnewargs__ = __getnewargs__
# load data
series = global_df['Deaths']
# prepare data
X = series.values.astype('float32')
# fit model
model = ARIMA(X, order=best_cfg)
model_fit = model.fit(trend='nc', disp=0)
# bias constant, could be calculated from in-sample mean residual
bias = bias
# save model
model_fit.save(r'data/model.pkl')
np.save(r'data/model_bias.npy', [bias])
# load finalized model and make a prediction
# Load model
model_fit = ARIMAResults.load(r'data/model.pkl')
bias = np.load(r'data/model_bias.npy')
# Pick a future to predict. 0 means literally tomorrow.
yhat = bias + float(model_fit.forecast()[0])
print('Predicted: %.0f' % yhat)
This proves the model was saved, loaded successfully, and performed a prediction. Lets forecast every day between now and December 31, 2020. We are also going to use ALL of the data since we have now put the model into "production".
global_df.shape
def difference(dataset, interval=1):
"""create a differenced series"""
diff = []
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return np.array(diff)
def inverse_difference(history, yhat, interval=1):
"""invert differenced value"""
return yhat + history[-interval]
# load dataset
series = global_df['Deaths']
# This algorithm accommodates seasonal variation which may be helpful if there is a season to covid.
# Flu season runs from November thru March with a peak in December and February each year.
# This is for the northern hemisphere. Obviously the opposite for the southern hemisphere.
# To get the maximum information into the model currently use length /2.
X = series.values
period = int(X.shape[0]/2)
differenced = difference(X, period)
# fit model
model = ARIMA(differenced, order=(best_cfg))
model_fit = model.fit(disp=0)
# forecast Dec 31 2020
start = global_df.index[-1]
end = pd.to_datetime('2021-01-01', infer_datetime_format=True)
steps = end - start
steps = steps.days
# Fit the out of sample
forecast = model_fit.forecast(steps=steps)[0]
# invert the differenced forecast to something usable
history = [x for x in X]
inverted_ = []
for day, yhat in enumerate(forecast):
inverted = inverse_difference(history, yhat, period)
history.append(inverted)
inverted_.append(int(inverted))
# Create a df for reporting
# Create a date_range index
start = global_df.index[-1]
end = start + timedelta(days=steps-1)
# Create the df
forecast_df = pd.DataFrame({'Deaths': inverted_}, index=pd.date_range(start=start, end=end))
# Shift Deaths by one day to make it lineup correctly with the date.
forecast_df['Deaths'] = forecast_df['Deaths'].shift(1)
forecast_df = forecast_df[1:]
forecast_df.tail(1)
That is a pretty big number. Lets check and see if it is reasonable. We will look at global_df and calculate its daily percent increase. Then we will look at forecast_df and see if it is comparable.
# Take a copy of global_df
global_df_c = global_df.copy(deep=True)
# Drop the extra columns in global_df_c compared to report_df
global_df_c = global_df_c[['Deaths']]
# Create the extra columns
global_df_c['Increase'] = global_df_c['Deaths'] - global_df_c['Deaths'].shift(1)
global_df_c['Percent_Increase'] = (global_df_c['Increase'] / global_df_c['Deaths']) * 100
global_df_c['Percent_Increase'] = round(global_df_c['Percent_Increase'], 2)
global_df_c['Deaths'] = global_df_c['Deaths'].astype(np.int)
# Look at it
global_df_c.tail()
# Look at daily percent increase for last 5 days
forecast_df['Increase'] = forecast_df['Deaths'] - forecast_df['Deaths'].shift(1)
forecast_df['Percent_Increase'] = (forecast_df['Increase'] / forecast_df['Deaths']) * 100
forecast_df['Percent_Increase'] = round(forecast_df['Percent_Increase'] ,2)
forecast_df['Deaths'] = forecast_df['Deaths'].astype(np.int)
forecast_df.head()
print('These means all have to do with the day to day percentage increase.\n')
print(f'Last 14 days actual data mean is: {round(global_df_c.Percent_Increase.iloc[-14:].mean(), 2)}')
print(f'First 14 days forecast mean is: {round(forecast_df.Percent_Increase.iloc[:14].mean(), 2)}')
print(f'Actual data mean is: {round(global_df_c.Percent_Increase.mean(), 2)}')
print(f'Forecast mean is: {round(forecast_df.Percent_Increase.mean(), 2)}')
If anything the ARIMA model is being MORE conservative than the data would suggest.
predicted = place_value(int(forecast_df["Deaths"].iloc[-1]))
forecast_date = forecast_df.index[-1].strftime('%Y-%m-%d')
print(f'The prediction is for {predicted} cumulative deaths to occur by {forecast_date}')
# Lets visualize that.
# Add a new column that is Deaths_e6 (deaths per million)
forecast_df['Deaths_e6'] = forecast_df['Deaths'] / 1000000
# Assemble title
start = forecast_df.index[0].strftime('%Y-%m-%d')
title = ('{} Forecast Cumulative Deaths In Millions For Covid-19 ({} to {})').format('Global', start, forecast_date)
plt.title(title)
# Create x and y axis labels
plt.xlabel('Date')
plt.ylabel('Cumulative Deaths In Millions')
# Create plot
plt.plot('Deaths_e6', data=forecast_df, linewidth=4, label='Deaths')
plt.legend()
plt.savefig(r'pics_final/arima_prediction.png');
ARIMA thinks that this is going to keep on going. I can certainly see why. There is nothing in the dataset that says it will do anything other than what it is doing here, which is heading up and to the right ver rapidly.
We developed 2 models. The first was based on trigonometry. This yielded an answer of approximately 817,000 people will die by the end of 2020. We also created a more sophisticated ARIMA model. This predicts several million people will not make it to 2021. The model was incredibly accurate in its test vs prediction scores. I think this is because the math of a pandemic is so simple. It is easy for the model to learn it. Lets hope that neither model is correct. However, I have a very bad feeling.
We answered 4 questions in this notebook. Please click on these links to look at those answers.
Answer To Q1 Answer To Q2 Answer To Q3 Answer To Q4 24 of 28 countries (that met the cutoffs) have a flattening of the curve presently. That is VERY encouraging.
It is surprising to see rich countries struggle with this virus. There is a saying that democracies get the type of leadership that they deserve. Hopefully this is a wake up call that electing incompetent politicians with massive character flaws to the highest office in the land can and does cause death. This time those deaths are being counted. They are not hidden in Syria, Afghanistan, etc. They are right in the home countries that have weak leadership.
Including the Median Per Capita Income and the Transparency International score was helpful in terms of creating cutoffs that made sense.
One sobering thought here is that there are 203 unique Alpha_3 codes in this dataset. That corresponds to 203 Cpountry_Regions. We are only looking at 28 of them. This is because of the concern over the honesty of the data. In otherwords the challenge that we are looking at here is only the tip of the iceberg.
Generally speaking I believe the reporting on the John Hopkins data is extremely suspect. This is due to political and on the ground issues. Politicians do not want to be accused of doing less than a great job, especially when it is obvious they are doing anything but a great job. Many countries simply do not have the infrastructure, systems, etc. to care for the sick and to report. People are dying and they are not reporting those statistics. Cases are happening and they do not have testing. It is likely that the only reasonable data that you will get and it will be still be grossly under reported in terms of actual cases will be from the western democracies. However, the Transparency International score I have included gives me some confidence for all of the findings in this notebook.
Update the John Holkins, Covid-19 data https://github.com/CSSEGISandData/COVID-19 by refreshing your local copy of the GitHub repository. The ARIMA model has been hyper parameter tuned, stored on disk and is ready to go. There is a forecasting module included for this dataset that can be ran at any time. It remains to be seen whether or not its predictions will be accurate. For additonal information please refer to the ReadMe.